Why some red wines taste better than others? Just because the wine tasters say so or there is another way to tell. Can we tell what make great wine or bad wine from their chemical properties? And if yes, under what conditions the quality of red wines is the best.
This is what we are going to explore: relationship of chemical properties with wine quality.
The analysis included: data structure, statistical summary, distribution plots, boxplots of each variables vs. quality, correlation matrix and scatter plots, final plots and data exploring the strong correlated variables, and reflections.
The data set using in this analysis can be found here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt.
##
## The downloaded binary packages are in
## /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
##
## The downloaded binary packages are in
## /var/folders/4g/g4gbmv3s773813tqb3cbhgrr0000gp/T//Rtmpf12B16/downloaded_packages
## [1] "/Users/thuy/Google Drive/Data-analysis-with-R"
First, let’s see the total of the wine data is:
## [1] 1599
samples.
Then, let’s explore the all variables.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
X is data entry number and quality is the output of the analysis. So, there were 11 total variables. The data is in wide format.
How is about the structure of the data?
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Quality was measured as int. All other variables were numerical data.
Statiscal summary of the data was shown below.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality was range from 3 to 8. Residual.sugar, chlorides, free.sulfur.dioxide and total.sulfur.dioxide had very large range of data. Do these variables influence wine quality?
First, let us explore the distributions of each variables using ggplot.
The data is in the format of wide data which make difficult for R to draw multiple variable plots. Therefore, I reshaped the data into long format.
# reshape data into long format
long_data <- melt(redwine, id.vars=c("X", "quality"))
# plot the distribution and density
dist_plot <- ggplot(long_data, aes(x=value)) +
geom_histogram(aes(y= ..density..),
binwidth=0.05, colour="green", fill="white") +
geom_density(color = "red", alpha = 0.2)
# iterate plot
dist_plot + facet_wrap(~ variable, scales = "free")
Some of the variables seem to follow normal distribution such as density, pH and fix.acidity while few others are right skewed distribution such as residual.sugar, free.sulfur.dioxide, total.sulfur.dioxide, sulphate, alcohol.
Most of the wine samples had wine quality of 5 and 6. Let’s get the real number.
# calculate the % of wine with quality 5 and 6
100*count(subset(redwine, quality == 5 | quality == 6))/length(redwine$quality)
## n
## 1 82.48906
There was 82.49 % of wines had quality of 5 or 6.
Let us run the correlation matrix to see what chemcial properties have strong relationships with wine quality and also with each others using ggpairs. It was difficult to plot ggpairs on all variables because the space allotted to the plot couldn’t hold 12^2 variables, so I created three groups and made sure that the variable “quality” (col 13) was presented in all.
We learned that any correlation above 0.3 is meaningful and 0.7 is pretty strong. Let us see if we could find any in the below results.
group1 <- ggpairs(redwine[c(13, 2:5)])
group1
Correlation efficients between quality with volatile.acidity was -0.391, citric.acid with fixed.acidity was 0.672, citric.acid with volatile.acidity was -0.552.
group2 <- ggpairs(redwine[c(13, 6:8)])
group2
Correlation efficient between total.sulfur.dioxide and free.sulfur.dioxide was 0.668.
group3 <- ggpairs(redwine[c(13, 9:12)])
group3
Correlation efficient between quality and alcohol was 0.476, pH and density was -0.342.
From the above correlation analysis, I found only alcohol and volatile.acidity had correlation coeffiencients bigger than 0.3 with quality. Since we are interested in what make best wine, it is important to consider some other chemical properties which may have some impacts.
Let’s see the below results.
## [,1]
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
We could see that there were 6 chemical properties (volatile.acidity, total.sulfur.dioxide, pH, free.sulfur.dioxide, density, chlorides) have negative correlation with quality. It suggested that those chemical properties make wine taste worse. Among those properties, volatile.acidity had the most impact with correlation of -0.391. While sulphates, residual.sugar, fixed.aciditym citric.acid, alcohol make wine tast better. Among those properties, sulphates, citric.acid, alcohol had the strongest impact with correlations of 0.251, 0.226 and 0.476 respectively.
From the boxplots, it looked like alcohol, sulphates, volatile.acidity and citric.acid might have impacts on the quality of wines. The results were consistent with previous correlation analysis.
Let’s zoom the plots of these chemical properties up.
As the wine quality increase from 3 to 8, there was an increase in average of alcohol, except for quality of 5. We also could see that wine with quality of 5 has many outliers.
Let’s compare the distributions of alcohol for different wine qualities
The distribution of alcohol were simlar and almost normal for all wine qualities except 5 where the distribution was much narrower.
Let’s see the summary of its alcohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
The mean of alcohol for quality of 5 was 9.89.
Let’s compare with other qualities
quality_vs_alcohol <- redwine %>%
group_by(quality) %>%
summarize(avg_alcohol = mean(alcohol)) %>%
arrange(avg_alcohol)
quality_vs_alcohol
## Source: local data frame [6 x 2]
##
## quality avg_alcohol
## (int) (dbl)
## 1 5 9.899706
## 2 3 9.955000
## 3 4 10.265094
## 4 6 10.629519
## 5 7 11.465913
## 6 8 12.094444
The average alcohol was increased from 9.955 to 11.094 (1.2 times) when wine quality increased from 3 to 8, except for quality of 5 where the average alcohol was 9.899.
As the wine quality increase from 3 to 8, there was an increase in average of citric.acid
Let’s compare the distributions of citric.acid for different wine qualities
We could see the mean of citric.acid shifted to the right with wine quality increased.
Let’s summary and arrange the mean of citric.acid
quality_vs_citric.acid <- redwine %>%
group_by(quality) %>%
summarize(avg_citric.acid = mean(citric.acid)) %>%
arrange(avg_citric.acid)
quality_vs_citric.acid
## Source: local data frame [6 x 2]
##
## quality avg_citric.acid
## (int) (dbl)
## 1 3 0.1710000
## 2 4 0.1741509
## 3 5 0.2436858
## 4 6 0.2738245
## 5 7 0.3751759
## 6 8 0.3911111
It was clearly to see the average value of citric.acid increased from 0.171 to 0.391 (2.3 times) when quality increased from 3 to 8.
As the wine quality increase from 3 to 8, there was an increase in average of sulphates.
Let’s compare the distributions of citric.acid for different wine qualities
We could see the distributions of sulphates were similar and the mean of sulphates shifted to the right with wine quality increased.
Let’s summary and arrange the mean of sulphates
quality_vs_sulphates <- redwine %>%
group_by(quality) %>%
summarize(avg_sulphates = mean(sulphates)) %>%
arrange(avg_sulphates)
quality_vs_sulphates
## Source: local data frame [6 x 2]
##
## quality avg_sulphates
## (int) (dbl)
## 1 3 0.5700000
## 2 4 0.5964151
## 3 5 0.6209692
## 4 6 0.6753292
## 5 7 0.7412563
## 6 8 0.7677778
It was clearly to see the average value of sulphates increased from 0.570 to 0.768 (1.3 times) when quality increased from 3 to 8.
As the wine quality increase from 3 to 8, there was an decrease in volatile.acidity.
Let’s compare the distributions of volatile.acidity for different wine qualities
We could see the distributions of volatile.acidity were similar and the mean of volatile.acidity shifted to the right with wine quality increased.
Let’s summary and arrange the mean of volatile.acidity
quality_vs_volatile.acidity <- redwine %>%
group_by(quality) %>%
summarize(avg_volatile.acidity = mean(volatile.acidity)) %>%
arrange(avg_volatile.acidity)
quality_vs_volatile.acidity
## Source: local data frame [6 x 2]
##
## quality avg_volatile.acidity
## (int) (dbl)
## 1 7 0.4039196
## 2 8 0.4233333
## 3 6 0.4974843
## 4 5 0.5770411
## 5 4 0.6939623
## 6 3 0.8845000
It was clearly to see the average value of volatile.acidity decreased from 0.884 to 0.404 (2.2 times) when quality increased from 3 to 8.
There was 1599 samples of wine with quality in range from 3 to 8. There was 82.49 % of wines had quality of 5 or 6.
There were strong correlations among the chemical properties such as citric.acid with fixed.acidity (0.672), citric.acid with volatile.acidity (-0.552), total.sulfur.dioxide and free.sulfur.dioxide (0.668), and pH and density (-0.342).
There were aslo strong correlations of some chemicals with quality such as quality with volatile.acidity (-0.391), quality and alcohol (0.476), quality and sulphates (0.251), quality and citric.acid (0.226).
# turn data in to data.table
wine_table <- data.table(redwine)
# add new rating variable
wine_table[, rating := ifelse(quality <=4, "bad",
ifelse(quality >=5 & quality <=6, "good",
ifelse(quality >=7, "very good", NA)))]
Let’s summarize the wine by rating.
wine_table %>%
group_by(rating) %>%
summarize(n_obs = n())
## Source: local data table [3 x 2]
##
## rating n_obs
## (chr) (int)
## 1 good 1319
## 2 very good 217
## 3 bad 63
So, there was 217 very good wines, 1319 good wines and 63 bad wines.
## Source: local data frame [6 x 12]
##
## quality avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
## (int) (dbl) (dbl) (dbl) (dbl)
## 1 5 9.899706 0.2436858 0.6209692 0.5770411
## 2 3 9.955000 0.1710000 0.5700000 0.8845000
## 3 4 10.265094 0.1741509 0.5964151 0.6939623
## 4 6 10.629519 0.2738245 0.6753292 0.4974843
## 5 7 11.465913 0.3751759 0.7412563 0.4039196
## 6 8 12.094444 0.3911111 0.7677778 0.4233333
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
## avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
## (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)
## Source: local data table [3 x 12]
##
## rating avg_alcohol avg_citric.acid avg_sulphates avg_volatile.acidity
## (chr) (dbl) (dbl) (dbl) (dbl)
## 1 bad 10.21587 0.1736508 0.5922222 0.7242063
## 2 good 10.25272 0.2582638 0.6472631 0.5385595
## 3 very good 11.51805 0.3764977 0.7434562 0.4055300
## Variables not shown: avg_fixed.acidity (dbl), avg_pH (dbl),
## avg_residual.sugar (dbl), avg_density (dbl), avg_total.sulfur.dioxide
## (dbl), avg_free.sulfur.dioxide (dbl), avg_chlorides (dbl)
We could clearly see the trend that the higher the wine rating the higher of both avg_fixed.acidity and avg_citric.acid were. It is supported that with both fix.acidity and citric.acid were strongly correlated with correlation coefficient of 0.672, and both chemicals were also correlated with quality with correlation of 0.124 and 0.226 respectively.
ggplot(quality_vs_total_variables, aes(x = avg_total.sulfur.dioxide, y = avg_free.sulfur.dioxide,
color = as.factor(quality))) +
geom_point()
ggplot(rating_vs_total_variables, aes(x = avg_total.sulfur.dioxide, y = avg_free.sulfur.dioxide,
color = as.factor(rating))) +
geom_point()
We could see the correlation of free.sulfur.dioxide and total.sulfur.dioxide. It was interesting to note that the wine quality was best with the middle range of both chemical properties (14 and 35 respectively). It was suggested that low concentration of the chemicals make wine taste bad, however too much of them also reduce wine quality. It also supported that the two chemicals were not well correlated with quality.
Correlated but not very well.